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This paper piesents a technique based on the intuitively-simple concepts of Sample Domain and Effective Prediction 
Domain, for dealing with linear regression situations involving collinearity of any degree of severity. The Effective 
Prediction Domain (EPD) clarifies the concept of collinearity, and leads to conclusions that are quantitative and 
practically useful. The method allows for the presence of expansion terms among the regressors, and requires no changes 
when dealing with such situations. 
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Introduction 

The scientists' search for relations between measurable 
properties of materials or physical systems can be effec- 
tively helped by the statistical technique known as multiple 
regression. Even when limited to linear regression, the tech- 
nique is often of great value, as we shall see below. Often, 
however, difficulties in interpretation arise because of a 
condition called collinearity. This condition, which is inher- 
ent in the structure of the design points (the X space) of the 
regression experiment, is often treated, at least implicitly, as 
a sort of disease of the data that is to be remedied by special 
mathematical manipulations of the data. 

We consider collinearity not as a disease but rather as 
additional information provided by the data to the data ana- 
lyst, warning him to limit the use of the regression equation 
as a prediction tool to specific subspaces of the X space, and 
telling him precisely what these subspaces are. Thus, 
collinearity is an indication of limitations inherent in the 
data. The statistician's task is to detect these limitations and 
to express them in a useful manner. If this viewpoint is 
adopted, there is no need for remedial techniques. All that 
is required is a method for extracting the additional informa- 
tion from the data. We will present such a method. 

About the Author: John Mandel is a statistical con- 
sultant serving with NBS ' National Measurement Labo- 
ratory. 



The Model 

We assume that measurements y have been made at a 
number of "x-points," each point being characterized by the 
numerical values of a number of "regressor-variables" Xj. 
We also assume that y is a linear function of the * -variables. 
The mathematical model, for/3 regressors, is: 



v=p,j: 1 + p 2 r 2 +. . . + p,*j + . . .+$ p x p +e 



(1) 



where e is the error in the v measurement. We denote by N 
the number of points, or "design points", i.e., the combina- 
tions of the x's at which y is measured. 

Usually, the variable x { is identically equal to "one" for 
all N points, to allow for the presence of a constant term. 
Then the expected value of y, denoted E(y), is equal to (3[ 
when all the other Jt 's are zero. This point, called the origin, 
is seldom one of the design points and is, in fact, quite often 
far removed from all design points. In many cases this point 
is even devoid of physical meaning. 

First Example: 
Firefly Data 

We present the problem in terms of two examples of real 
data. The first data set (Buck [l] 1 ) is shown in table 1. It 
consists of 17 points and has two regressors, in addition to 



'Figures in brackets indicate literature references. 
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Table 1. Data for firefly study. 



*l 



*2 



*3 



1 26 


21.1 


45 


1 35 


23.9 


40 


1 40 


17.8 


58 


1 41 


22.0 


50 


1 45 


22.3 


31 


1 55 


23.3 


52 


I 55 


20.5 


54 


1 56 


25.5 


38 


1 70 


21.7 


40 


1. 75 


26.7 


28 


1 79 


25.0 


38 


1 87 


24.4 


36 


1 100 


22.3 


36 


1 100 


25.5 


46 


1 110 


26.7 


40 


1 130 


25.5 


31 


1 140 


26.7 


40 



tion of this type between some of the regressor variables 
often causes difficulties in the interpretation of the regres- 
sion analysis. To deal with the problem in a general way we 
propose a method based on two concepts. The first of these 
we shall call the "sample domain." 

For our data, the sample domain consists of the rectangle 
formed by the vertical straight lines going through the low- 
est and highest x 2 of the experiment, respectively, and by the 
horizontal straight lines going through the lowest and 
highest x 3 , respectively (See Fig. 1). The concept is readily 
generalized to an X space of any number of dimensions, and 
becomes a hypercube in such a space. Note that the vertex 
B of the sample domain is relatively far from any of the 
design points. This has important consequences. 

The regression equation 



Definition of Variables 

y=time of first flash (number of minutes after 6:30 p.m.) 

xj- light intensity (in metercandles, mc) 

x 3 = temperature (°C) 

a constant term (*[= 1). The measurement is the time of the 
first flash of a firefly, after 6:30 p.m. It is studied as a 
function of ambient light intensity (x 2 ) and temperature (x 3 ). 
Figure 1 is a plot of x 3 versus x 2 . There is obviously a 
trend: jc 3 increases as x 2 increases. The existence of a rela- 



5>=&i-*I + k-Jf2+& 3 -.* 3 



(2) 



allows us to estimate y at any point (x [ ,x 2 , x 3 ) (we recall that 
Xi= 1) and to estimate the variance of y at this point. The 
point can be inside or outside the sample domain. Obviously 
the variance of y, which we denote by Var (y), will tend to 
become larger as the point for which the prediction is made 
is further away from the cluster of points involved in the 
experiment. Therefore Var (y) at the point B may be consid- 
erably larger than at points A , C , and D . Such a condition 
is associated with the concept of "collinearity ." We define 
collinearity, in a semi-quantitative way, as the condition 
that arises when for at least one of the vertices of the sample 
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Figure 1 — Sample domain. 
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domain, Var (y) is considerably larger than for the other 
vertices. The concept will become clearer as we proceed. 

At any rate, the larger variance at one of the vertices of 
the sample domain is generally the lesser of two concerns, 
the other being that the regression equation, for which valid- 
ity may have been reasonably firmly established in the vicin- 
ity of the cluster of experimental points, may no longer be 
valid at a more distant point. It is important to note that the 
evidence from the data alone cannot justify inferences at 
such distant points. In order to validate prediction at such 
points, it is necessary to introduce either additional data or 
additional assumptions. 

For these reasons, we seek to establish a region in the 
X-space for which prediction is reasonably safe on the basis 
of the experiment alone . We call this the Effective Predic- 
ation Domain , or EPD. 

The EPD is the second concept required for our treatment 
of collinear data. It is closely related to the first concept, the 
sample domain, as will be shown below. 

Establishing the EPD 

Our procedure consists of two steps, involving two suc- 
cessive transformations of the coordinate system. The orig- 
inal coordinate system in which the jc-regressors are ex- 
pressed is referred to as the X-system . 

1. The Z System 

The first step consists in a translation of the X-system 
(parallel to itself) to a different origin, located centrally 
within the cluster of experimental points (centering); and 
simultaneously by a reseating of each x to a standard scale. 
The new system, called the Z-system , is given by the equa- 
tions 2 



!;=<>, 2*1=1 



(5) 



It is then reasonable to choose a value K in (3 a) equal to 

K = 1/Vn (6) 



so as to make ")> z,i=l 



The values of C, and Rj for the firefly data are given in 
table 2, Contrary to statements found in the literature (see 
discussion at end of this paper) , the centering and rescaling 
defined by the Correlation Scale Transformation have no 
effect whatsever on collinearity. The location of the sample 
domain relative to the design points remains unchanged, 
though it is expressed in different coordinates. 

To arrive at an EPD, a second operation is necessary, viz. 
a rotation of the Z -coordinate system to a new coordinate 
system, which we shall call the W-system (of coordinates). 

2. The W-System 

The rotation from Z to W is accomplished by the method 
of Principal Components, or its equivalent, the Singular 
Value Decomposition (SVD). For a discussion of this 
method the reader is referred to Mandel [2]. Here we merely 
recall a few facts. Each w -coordinate is a linear combination 
of all z-coordinates given by the matrix equation: 



W=ZV 

where V is an orthogonal matrix. 
In algebraic notation, eq (7) becomes 



(7) 



For y' = l: z x =K (a constant) 
For;>l: zr^R- 1 



(3a) 



(3b) 



For Cj and Rj we consider two choices, which we call the 
Correlation Scale Transformation (CST) and the Range 
Midrange Transformation (RMT). We discuss first the Cor- 
relation Scale Transformation defined by the choice 



C j = *j> R j = J^j ( x ij- x j) 2 



(4) 



where / = 1 to N . 

It easily follows from (3b) that 



2 We assume that in the X-system, the regressor jr, is identically equal to 
unity, to allow for an independent term. 



W, 



'a =2 z ff v */ 
J 



*' = 1 toJV 
j = \ top 



(8) 



where the v kj are the elements of the V matrix. The v kj , for 
a given k , are simply the direction cosines of the w k axis 
with respect to the Z-system. Consequently, 



2 "3=i 



(9) 



Table 2. Firefly data — parameters for correlation scale transformation. 



73.176471 
23.582353 



4.123106 

135.264447 

10.073962 
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Since the rotation is orthogonal, any two distinct vv-axes, 
say w k and w t - , are orthogonal and consequently: 



2 V v *7 ! 
J 



forJfc*** 



(10) 



For the firefly data, the V matrix is shown in table 3, and the 
complete set of z and w coordinates is given in table 4. 

Note that row 2, as well as column 1, in table 3 consists 
of the element "one" in one cell and zeros in all others cells. 
This is a consequence of the orthogonality of z t with respect 
to all Zj with j > 1 . This orthogonality is in turn due to the 
nature of the Correlation Scale Transformation, as ex- 
pressed by eq (4). 

At the bottom of the w columns we find values labeled hj . 
They are simply the sums of squares of all w -values in that 
column. 



v5>3 



(id 



Table 3. Firefly data — V matrix. 









j 






k 


i 




2 




3 


1 







.7071 




.7071 


2 


1.000 












3 





- 


.7071 




.707! 


Table 4. Firefly data — z and 


iv coordinates (CST). 1 


Point 


Zl 


Z3 




»1 


w 3 


1 


-.3488 


-.2464 




-.4216 


.0724 


2 


-.2822 


.0315 




-.1780 


.2219 


3 


-.2453 


-.5740 




-.5800 


-.2324 


4 


-.2379 


-.1571 




-.2800 


.0572 


5 


-.2083 


-.1273 




-.2381 


.0573 


6 


-.1344 


-.0280 




-.1156 


.0753 


7 


-.1344 


-.3060 




-.3121 


-.1213 


8 


-.1270 


.1904 




.0440 


.2245 


9 


-.0235 


-.1869 




-.1495 


-.1155 


10 


.0135 


.3095 




.2276 


.2094 


11 


.0431 


.1407 




.1292 


.0691 


12 


.1022 


.0812 




.1289 


-.0148 


13 


.1983 


-.1273 




.0495 


-.2302 


14 


.1983 


.1904 




.2741 


-.0055 


15 


.2722 


.3095 




.4106 


.0264 


16 


.4201 


.1904 




.4309 


-.1624 


17 


.4940 


.3095 




.5674 


-.1304 








h° 


=1.6549 


\ z = .3451 


i2[=i;\/i7= 


.2425 for all i 










w 2 =i/Vn= 


= .2425 for all i, \ 2 = 


=1.0000 









The \j are also the eigenvalues of the Z'Z matrix which, 
for our choice of C= and Rj, is the correlation matrix of the 
regressors x. Note that w 2 is the constant=l/VN. Conse- 
quently 



^(v*)'" 



1 



We need to consider w l and w 3 only. A similar situation 
applied to the z coordinates, where z t — 1/V/V for all i. 
Figure 2 shows both the z -coordinates (z 2 and z 3 ) and the 
vv -coordinates (w, and w 3 ) for the firefly data. The order of 
the w -coordinates (w t , w 2 , w-j) is that of the corresponding 
X-values, in decreasing order. 



3. The Effective Prediction Domain (EPD) 

The EPD is simply the sample domain corresponding to 
the W-system of coordinates. Thus, straight lines parallel to 
the M> r axis are drawn through the smallest and largest w u 
respectively, and lines parallel to the w r axis are drawn 
through the smallest and largest vv 3 . Here again generaliza- 
tion is readily made to a p -dimensional W-space. The EPD 
for the Firefly data is also shown in figure 2. 

The interpretation of EPD is straightforward. Unlike the 
sample domain in either the X-system or the Z-system, the 
EPD excludes points that are distant from the cluster of 
regressor points. This has two advantages. In the first place, 
the use of the regression equation is justified for all points 
inside, and on the periphery of the EPD. And accordingly, 
the variance of the predicted value y for any such point will 
not be unduly large. These statements require more detailed 
treatment. To this effect we introduce the concept of vari- 
ance factor (VF). 



4. The Variance Factor (VF) 

From regression theory we know that the variance of any 
linear functon, say L , of the coefficient estimates 0^ is of the 
form: 



War (L)=f(X)-ul 



(12) 



where o^ is the variance of the experimental errors e of the 
y measurements. The multiplier f(X) is independent of the 
v and depends only on the X matrix and on the coefficients 
in the L function. We call this multiplier the variance fac- 
tor, VF. 
Thus, we have: 



Var ((};)= VF(P,-)-o- e 2 



(13) 



and 
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26.7 



«4) 



& 



17.8 




Figure 2— EPD for firefly data. 



26 



140 



Light Intensity (X 2 ) 



Var(y)=VF(y)-oi 



(14) 



In eq (14), y is the estimated, or predictedy value at any 
chosen point in X-space. VF (y) is of course a function of 
the location of this point. 

Returning now to our statements above, it is well-known 
that a regression equation can show excellent (very small) 
residuals and yet be very poor for certain prediction pur- 
poses. The small residuals merely mean that a good fit has 
been obtained at the points used in the experiment . This is 
no guarantee that the fit is good at other points. However, 
if the regression equation is scientifically reasonable, it is 
likely that the experimental situation underlying it will also 
be valid for points that are close to the cluster of the regres- 
sor points used in the experiment. Every point in the EPD 
satisfies this requirement. 

Furthermore, the variance of prediction, measured by the 
VF, will also be reasonably small for all points of the EPD, 



simply because they are geometrically close to the design 
points. 

The calculation of VF (y) is quite simple, once the V- 
matrix and the \ values have been calculated. It is based on 
the equation 



VF(y) = 2 
k 



u\ 



where u k is defined as: 






Combining eqs (8) and (16), we obtain 



(15) 



(16) 



(17) 
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and hence: 




(18) 



Figure 3 shows the VF values at the vertices of the orig- 
inal sample domain and of the EPD. Interpreting these re- 
sults, we see that the collinearity of our data is reflected in 
the rejection of an appreciable portion of the sample domain 
for purposes of safe prediction. This does not mean that 
prediction outside the EPD is impossible, or unacceptable. 
It merely means that such prediction cannot be justified on 
the basis of the data alone. Of course, the risk of predicting 
outside the EPD increases with the distance from the EPD. 
It will generally be reasonably safe to use the regression 
equation even outside the EPD, as long as the point for 
which prediction is made is reasonably close to the borders 
of the EPD. Using eq (18), the VF for any contemplated 
prediction point is readily calculated and can serve as a basis 
for decision. 



Second Example: 
Calibration for Protein Determination 

The instructive and intuitively satisfying graphical dis- 
play of the EPD becomes impossible when the number of 
regressors, including the independent term, exceeds 3. We 
must then replace the graphical procedure by an analytical 



one, as will now be shown in the treatment of our second 
example. 

The data were presented by Fearn [3], in a discussion of 
Ridge Regression, They represent the linear regression of 
percent protein, in ground wheat samples, on near-infrared 
reflectance at six different wavelengths. 

For reasons of simplicity in presentation, we include here 
only three of the six wavelengths, a change that has a rather 
small effect on the final outcome of the analysis: it turns out 
that the regression equation based on these 3 wavelengths is 
very nearly as precise as that based on 6 wavelengths. 

The data, displayed in table 5, are a very good example 
of the use of regression equations: the regression equation is 
indeed to be used as a "calibration curve" for the analysis of 
protein, using the rapid spectrometry instead of the far more 
time-consuming Kjeldahl nitrogen determination. Our data 
have an N value of 24, and p (including the independent 
term) is 4. 

Table 6 exhibits the correlation matrix of the 24 design 
points. It is very apparent that the x values at all three 
wavelengths are highly correlated with each other, thus indi- 
cating a high degree of collinearity. At a first glance one 
would be very skeptical about such a set of data, and suspect 
that the X matrix shows such a high degree of redundancy 
as to make the regression useless for prediction purposes. 
Fearn explains that the correlations are more a reflection of 
particle size variability than of protein content. Our analysis 
will confirm that, properly interpreted, the data lead to a 
very satisfactory calibration procedure. 

We will find it useful to introduce a slightly different Z 
transformation, which we call the Range-Midrange Trans- 
formation . 




Sample Domain 



Vertex 




VF 




A 




.39 




B 




1.71 




C 




.69 




D 




.30 






EPD 




Figure 3 — VF at vertices of sample 
domain and of EPD. 


Vertex 




VF 




a 




.41 




b 




.41 




c 




.41 




d 




.40 
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Table 5. Protein Calibration Data<*> 







Reflectance 




% Protein 


Point 


Xz 


*J 


x 4 


y 


1 


246 


374 


386 


9.23 


2 


236 


386 


383 


8.01 


3 


240 


359 


353 


10.95 


4 


236 


352 


340 


11.67 


5 


243 


366 


37J 


10.41 


6 


273 


404 


433 


9.51 


7 


242 


370 


377 


8.67 


8 


238 


370 


353 


7.75 


9 


258 


393 


377 


8.05 


10 


264 


384 


398 


11.39 


11 


243 


367 


378 


9.95 


12 


233 


365 


365 


8.25 


13 


288 


415 


443 


10.57 


14 


293 


421 


450 


10.23 


15 


324 


448 


467 


11.87 


16 


271 


407 


451 


8.09 


17 


360 


484 


524 


12.55 


18 


274 


406 


407 


8.38 


19 


260 


385 


374 


9.64 


20 


269 


389 


391 


11.35 


21 


242 


366 


353 


9.70 


22 


285 


410 


445 


10.75 


23 


255 


376 


383 


10.75 


24 


276 


396 


404 


11.47 



EPD for the Protein Data 

The EPD resulting from the Singular Value Decomposi- 
tion based on the Range-Midrange Transformaton will not 
be he same as the EPD we would have obtained using the 
Correlation Scale Transformation, but we will see that those 
features of the EPD that are of importance for us, in estab- 
lishing the limitations of the regression equation, are practi- 
cally unaffected. 

Table 7 shows the C and R values for the four regressors 
and table 8 exhibits the V matrix and the X values obtained 
from the Singular Value Decomposition. The latter, it may 
be recalled, simply expresses the rotation of the Z coordinate 
system to the W system. 

For each w k coordinate, there are 24 values, correspond- 
ing to the 24 regressor points. 

Table 9 shows the smallest and the largest w k value, for 
each of the four k . 

According to table 9, we must have, in the EPD: 

-1.9282^^^.6181 (20) 

with similar statements for w 2 , w 3 , and w 4 . Applying now eq 



<*>.*, = 1 



Table 6. Protein calibration data — correlation matrix of xi through x 4 . 



.9843 



.9337 
.9545 



Table 7. Protein calibration data — parameters for Z transformation 
(RMT). 




296.5 
418.0 
432.0 



R 

1 
63.5 
66.0 
92.0 



The Range-Midrange Transformation 

The Range-Midrange Transformation (RMT) is defined 
as follows: 



For 7 = 1: z 1 = 1 



(19a) 



Table 8. Protein calibration data — V matrix and \ values (RMT). 



k 


1 


2 


3 


4 


A. 


1 


-.6665 


.4845 


.4217 


.3784 


43.7810 


2 


.7365 


.3299 


.3797 


.4523 


8.3782 


3 


-.1096 


-.5491 


-.2509 


.7896 


.3758 


4 


-.0332 


-.5958 


.7843 


-.1698 


.06624 



For7>l: 



Z; = 



_ J 



R, 



(1%) 



but now Cj is defined as the midrange of the N values of Xj 
and Rj is one-half the range of these values. With these 
definitions, it is clear that the smallest z -value, for any 
regressor, is (-1) and the largest z-value is (+1). It is 
because of this - 1 to + 1 scale that this transformation was 
introduced. The benefits of this scale will become apparent 
in the following section. 



Table 9, Protein calibration data — limits defining the EPD. 



Coordinate 

m 



Smallest w 



-1.9282 
-.4097 
-.1669 
-.0801 



Largest w 



.6181 

1.8989 

.3158 

.1324 
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(8), this double inequality can be written: 

-l.9282g-.6665 z, + ,4845 z 2 -K4217 z 3 +.3784 z 4 
g.6181 



Table 10. Protein calibration data — effect of Z transformation. 1 



w coordinate 



Z Transf. 



Inequalities 



Since z, is constant and 
comes: 



: 1 , this double inequality be- 



-1.2617^.4845 z 2 +.4217 z 3 +. 3784 z 4 =gl. 2846 . (21a) 

With the RMT, the value of any z k is, for any k > 1 , between 
(—1) and (+1). Thus the expression in the middle has, for 
all design points, a value between -1.2846 and 1.2846, 
where 1.2846 is the sum of the absolute values of the three 
coefficients. Therefore, the double inequality expressed by 
eq (21a) holds, essentially, for every point in the original 
sample domain. Thus, w h the first coordinate of the EPD, 
which represents its largest dimension, imposes essentially 
no restrictions on the sample domain. 

Doing the same calculations for the three other 
w -coordinates (see table 9), we obtain, respectively: 

-1.1 462^. 3299 z 2 +.3797 z 3 +.4523 z 4 ^1.1619 (21b) 

-0.568g-.5491 z 2 -.2509 z 3 +.7896 z 4 ^.4254 (21c) 

-.0469^-. 5958 z 2 +.7843 z 3 -.1698 z 4 g.l656. (21d) 

We see that w 2 too, imposes only very light restrictions on 
the sample domain. On the other hand, w 3 and w 4 do imply 
limitations that eliminate appreciable portions of the sample 
domain from the EPD. 

We could readily convert eqs (21c) and (21d) to x coordi- 
nates by means of table 7 and eqs (19a) and (19b), but the 
z -coordinates, using the Range-Midrange Transformation, 
are more readily interpreted in terms of the severity of 
collinearity than the x -coordinates. 

Thus, the sum of the absolute values of the coefficients in 
the middle terms of (21c) and (21d) are 1.5896 and 1.5499, 
respectively. Points for which these linear combinations 
take the valves ±1.5896 and ±1.5499 exist in the original 
sample domain. The EPD, on the other hand, limits these 
functions to intervals with much narrower limits. 



Effect of Type 
of Z Transformation 

We have used two different Z transformations, the Corre- 
lation Scale, and the Range-Midrange. It is proper to ask 
how our results would have been affected in the Protein 
Calibration Data, had we used Correlation Scale, instead of 
the Range-Midrange Transformation. We show the com- 



1 


CST 




RMT 


2 


CST 




RMT 


3 


CST 




RMT 


4 


CST 




RMT 



-3.034sl.021 2 2 +I-061 z 3 +z 4 £3.082 
-3.334==1 .280 r 2 + 1 . 1 14 z 3 +z 4 <3.395 

-2.534<.729 z 2 + .840 z 3 +z 4 <2.569 
-.075S-.686 z 2 -.321 z 3 +z 4 .535 
-.072S-.695 z 2 -.318 z 3 +z 4 £.539 
-.278==-3.531 z 2 +4.640z 3 -z 4 £;.980 
-.276:3-3.509 z 2 +4.619 z 3 -z 4 ==.975 



'All inequalities are expressed in RMT i coordinates. 

parison in table 10. Let us recall that with the CST, one of 
the w coordinates yields a X- value of unity, and a constant 
w value for all points. Therefore, we obtain for CST, only 
three sets of inequalities, as compared to the four sets for 
RMT. To allow the comparison between the two transfor- 
mation to be made, we have multiplied eqs (21a) through 
(2 Id) by positive constants, so as to make the coefficient of 
z 4 equal to ± 1 . The same was done for the corresponding 
inequalities obtained by the Correlation Scale Transforma- 
tion. 

Of course, since the z coordinates are different for the two 
transformations, the inequalities for the CST, expressed in 
the CST z -units, had to be converted to RMT z -units, for a 
meaningful comparison. As can be seen from table 10, the 
two smallest dimensions of the EPD are practically the same 
for the two transformations. Thus, even though the method 
of principal components is not invariant with respect to 
linear transformations of scale, our analysis leads, in this 
case, to very similar results for the small dimensions of the 
EPD. We believe that this is generally true for all situations 
in which collinearity is noticeable, i.e., for all situations in 
which the EPD eliminates considerable portions of the orig- 
inal sample domain. For situations in which this does not 
apply, i.e., totally non-collinear cases, the inequalities do 
not matter, since they impose no restrictions on the sample 
domain. 

It is interesting to contrast the remarkable similarity be- 
tween the inequalities for w 3 and u> 4 for the two transforma- 
tions in table 10, with the behavior of a commonly advo- 
cated measure of collinearity (Belsley, Kuh, and Welsch 
[4], the condition-number. 

The SVD resulting from the CST yields the following 
eigenvalues: 2.9151, 1.0000, .07176, .01312. The condi- 
tion number is defined as the ratio of the largest to the 
smallest eigenvalue. In this case: 

condition number=2, 9 15 1/.013 12=222. 2 

On the other hand, the SVD resulting from the RMT on the 
same data yields the eigenvalues: 43.7810, 8.3782, .37575, 
.066244. This time we have: 
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condition number=43.7810/.066244=660.9 . 

Thus the condition number varies considerably when the 
data are subjected to different standardizing transforma- 
tions. It is not clear what useful information can be derived 
from the condition number. 

By contrast, the treatment of collinearity we advocate has 
a useful and readily understood interpretation: the EPD is 
that part of the X space in which, and near which, prediction 
is safe. It also indicates what portions of the original sample 
domain are inappropriate for prediction on the basis of the 
given data alone . It fulfills this function in a way which is 
practically invariant with respect to intermediate transfor- 
mations of scale. We use the qualifier "intermediate" be- 
cause collinearity has meaning only in terms of a given 
original coordinate system (the X system). This system, 
which determines the original sample domain, must be con- 
sidered fixed. On the other hand, transformations of this 
system prior to calculating the EPD can be defined in differ- 
ent ways without affecting the practical inferences drawn 
from the data on the basis of the final EPD derived form the 
standardizing transformation. 

Cross-Validation 

We can take advantage of the availability of a second set 
of protein calibration data, also given in Fearn [3]; to verify 
the correctness of our approach. Fearn lists 26 additional 
points for which the reflectance measurements, as well as 
the Kjeldahl nitrogen determination, were made. We ap- 
plied the Z transformation obtained above (RMT on first set 
of 24 points) to each of these 26 points, and noted every 
point for which at least one of the four sets of inequalities 
(21a) through (2 Id) failed to be satisfied. We found 14 such 
points. This means that 14 "future points" obtained under 
the same test conditions were outside the EPD established 
on the basis of the original 24 points. However, as we 
observed above, as long as the point is not far from the EPD, 
prediction at that point is likely to be valid. We tested 
"predictability" at these 14 points by calculating the VF 
value for each of them, and by comparing the predicted 
protein value with the measured one. The results are shown 
in table 11. It is apparent that all VF are relatively small, 
indicating that even though these 14 points are outside the 
EPD calculated from the original set, they are not far from 
that EPD. This is confirmed by the good agreement between 
the observed and predicted values. The standard deviation 
of fit for the original set of 24 points was 0.23; the standard 
deviation for a single measurement derived from the 14 
differences in table 11 is 0.30. 

Expansion Terms 

Quite frequently, a regression equation contains x vari- 
ables that are non-linear functions of one or more of the 



Table 11. Protein calibration data — cross-validation of analysis. 





% Protein 






Point 1 


Observed 


Predicted 


VF 


1 


8.66 


9.53 


.281 


4 


11.77 


11.97 


.416 


6 


10.46 


10.96 


.193 


9 


12.03 


11.47 


.212 


10 


9.43 


9.54 


.762 


11 


8.66 


8.15 


.454 


12 


14.44 


13.99 


.881 


14 


10.41 


10.17 


.468 


16 


11.69 


11.24 


.472 


17 


12.19 


11.83 


.390 


18 


11.59 


11.39 


.314 


20 


8.60 


8.39 


.201 


22 


9.34 


8.93 


.151 


26 


10.89 


10.94 


.741 



! Point in additional set (Fearn [3]) with its number designation in that set. 

other x variables, such as x\, x 2 -x d , etc. Polynomial regres- 
sions are necessarily of this type. Since the x variables are 
non-stochastic in the usual regression models, the least 
squares solution for the regression equation is not affected 
by the presence of such "expansion terms." On the other 
hand, collinearity can be introduced, or removed, or modi- 
fied by them. 

In our treatment the expansion terms cause no additional 
problems. Consider for example, the regression 



y = Pl*l + p2*2 + p3*2 + € 



(22) 



with j: t =l. 



Here we have/? = 3. Using RMT, followed by a singular 
value decomposition, we obtain an EPD of three dimen- 
sions, leading to the inequalities. 



A Y <w^B u A 2 <w 2 <S 2 , A 3 <W3<B 3 



(23) 



Expressing the w as functions of the z , this leads to three 
double inequalities governing the z , of the form 



Arf x (z)^B lt A 2 </ 2 (z)<B 2 , A 3 </ 3 (z)^% 

Now, since x 3 =X2, we have 

* 3 -C 3 4-C 3 (R 2 z 2 +C 2 ) 2 -C 3 



(23) 



Hence: 



R? 



C 2 C 3 



R, R, 



2 C 2 R 2 t R 2 2 



(24) 



Because of this relation the functions f l (z),f 2 (z), / 3 (z) be- 
come functions of z h z 2 (and zf) only. Using this fact, we 
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interpret the three sets of inequalities (23) exactly as we 
have interpreted eqs (21a) through (2 Id) by determining 
which of these inequalities, if any, impose restrictions on 
the use of the original sample domain. 

To illustrate this procedure, consider the small set of 
artificial data shown in table 12, for which the model is 
given at the bottom of the table. The term Xj=x 2 introduces 
a high correlation between x 2 and x 3 and consequently also 
considerable collinearity. 

The inequalities characterizing the EPD based on a 
Range-Mi drange Transformation and converted to the z- 
scales, are shown in table 13. Applying eq (24) to express 
Zj, in terms of z 2 , the three double-inequalities become: 



for tv,:-. 8431SS1. 1284 z 2 +,2853 2 2 2 ==1.413' 
for w 2 :-. 6928s. 8541 z 2 +.I612 z|:£ 1.0153 
for w-,: .0077<.0003 2,+ . 3218 z?=£.322L 



for w 2 :-. 6928s. 8541 z 2 +.I612 z 2 2 
for w 3 : .0077<.0003 z 2 + .3218 z\ 



It is readily verified that of these sis inequalities, all but 
one are satisfied for all i 2 values between -1 and +1. The 
last one, involving the left side of the (hird set, is satisfied 
for all 2 2 values except for the interval: — .156<z 2 — ■ 155. 
This corresponds to an x 2 interval between 2.1 and 2.8, or 
between the design points jt 2 =2. 1 and jr 2 = 3.6 (see table 
12). The interpretation of this finding is that while all design 
points are of course inside the EPD, a small portion of the 
curve jc 2 versus x 2 falls slightly outside the EPD. This is of 
no practical significance since the VF for these points, even 
though they are outside the EPD, does not exceed 0.58. By 
comparison, the smallest VF value along the curve, for the 
range ;c 2 =,2 to x 2 ~**-7 t is of the order of 0.26. Thus we see 
that the serious collinearity in this data set is merely a 
consequence of the presence of the expansion term x 3 =x^ . 



Tabic 12. An artificial quadratic example 1 



Point 


*2 


*3 


y 


1 


.2 


.04 


28.3 


2 


.4 


.16 


27.5 


3 


1 


1.00 


25.6 


4 


2.1 


4.41 


2S.7 


5 


3.6 


12.96 


46.4 


6 


4,7 


22.09 


69. S 



h = f, l x l + p. i x 1 +ti 1 x i +i; (3]=30, pj = 8, (J, = 3.5. <r E =0.2 *i=J. 



Table 13. Quadratic example — inequalities for EPD. 



W-coordinale 



Inequalities 



W| 

w 2 
W3 



-1.1281^-. 5070 z 2 +.6214 z 3 £l.l284 
-.8541=5.5029 Z2+.3511 z 3 s.8541 
-.3141S-.7005 z 2 +.700S z 3 =£.OO03 



Any point in X space, in order to be acceptable, must lie on 
the curve x^x*. Artjc 3 with any other value is obviously not 
valid and our analysis of the data, through the EPD, calls 
attention to this fact: in the direction of w 3l the width of the 
EPD is only .31 as compared with widths of 2.26 and 1.71 
for Wj and w 2 . 

Discussion 

The common mathematical definition of collinearity is 
(he existence of at least one linear relation between the x 's, 
of (he form 






2j c J x u =0 



j = l top 



(25) 



where the c } are not all zero, and such that eq (25) holds with 
the same c- } values, for all i. This defines what we shall call 
"exact collinearity," Geometrically, it means that all design 
points lie in an hyperplane of the jc -space, going through the 
origin of the coordinate system. Equation (25) also implies 
that the matrix X 'X is singular, and consequently that the 
estimates of the |3 coefficients are not uniquely defined. 

Exact collinearity seldom occurs in real experimental sit- 
uations; indeed, if the X matrix is not the result of a designed 
experiment, it is highly improbable that a relation such as eq 
(25) would hold exactly. If, on the othei hand, the experi- 
ment is designed, care would generally have been taken to 
avoid a situation of exact collinearity. 

While exact collinearity is practically of little concern, 
near-collinearity is a frequent occurrence in real-life data. 
This occurs when an equation such as (25) is "approx- 
imately" true for all i . Many attempts have been made to 
define more closely the concept of near-collinearity, but 
while these endeavors have led to a number of proposals for 
measuring collinearity, they are of little practical use to the 
experimenter confronted with the task of interpreting his 
data. 

It is not our intention to discuss here the pros and cons of 
the various attempts made by a number of authors to 
"remedy" a near-collinear situation. The best-known of 
these remedial procedures is Ridge Regression . We merely 
repeat what we have said in the body of the paper: any 
attempt to remedy collinearity must necessarily be based on 
additional assumptions, unless it consists of making addi- 
tional measurements. The latter alternative is of course log- 
ical and valid, but the making of assumptions invented 
specifically for the purpose of removing collinearity does 
not appear to us to be a recommendable policy in data 
analysis. 

One easily recognizable condition leading to collinearity 
is the existence of at least one high correlation coefficient 
among the non-diagonal elements of the correlation matrix 
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of the x's. This has given rise to the concept of the Variance 
Inflation Factor (VIF). The VIF for p, is defined (Draper 
and Smith [5]), as: 



VIFdJ^yz^ 



(26) 



where Rj is the multiple correlation coefficient of Xj on all 
other regressors. If d represents a residua] in this regression, 
the usual formula for Rj is given by 



RJ=X- 



^d 1 



2 ^-*/) 2 



(27) 



Now, Snee and Marquardt (Belsley [6], "comments") make, 
implicitly, a distinction between the two "models": 

y=$ i x 1 + $ 2 x 2 +- + $ p x p +e ( 28a > 

with *j = 1 , and 

y-y=P 2 (* 2 -* 2 )+- + P p (x p -x p )+e (28b) 

where (28b) is called the "centered" model. For (28b), Snee 
and Marquardt use eq. (27), but for (28a) they appear to use 
the definition: 



R}=\ 



2d 2 



' 24 



(29) 



Equation 29, in which the denominator of the last term is not 
centered, is not explicitly given by Snee and Marquardt, but 
is implied by their statement: 

"If the domain of prediction includes the full 
range from the natural origin through the range of 
the data, then collinearity diagnostics should not 
be mean-centered," and confirmed by the VIF 
values given in their table 1. In this table, "no 
centering" results in VIF values of 200,000 and 
400,000, while the VIF for the "centered" data are 
unity. The quoted statement occurs in a section 
entitled "Model building must consider the in- 
tended or implied domain of prediction." The 
basic idea underlying the section in question is 
that the analysis of the data, based on the 
"collinearity diagnostics" (specifically: the VIF 
values), is goverened by the location of the points 
were one wishes to make predictions and, more 
specifically, on whether the origin (X| = l, 
x 2 =xy=0) is such a point. The VIF values 
which, according to Snee and Marquardt' s formu- 



las, depend heavily on whether or not this origin 
is included, will then indicate the quality of the 
predicted values. 
A more reasonable approach, and one more consistent with 
the procedures commonly used by scientists, is to limit 
prediction to the vicinity of where one made the measure- 
ments, unless additional information is available that justi- 
fies extrapolation of the regression equation to more distant 
points of the samples space. The vicinity of the measured 
points is determined by the EPD which, in the case of 
collinearity, may be considerably smaller than the sample 
domain. In this view, it is the location of the design points , 
rather than that of the intended points of prediction , that 
determines predictability. The latter is measured, not by 
VIF values, but rather by the more concrete VF values, for 
any desired point of prediction. 

The view advocated by Snee and Marquardt sometimes 
results in an enormous difference in the VIF values between 
the centered and non-centered forms. Equation 29 serves no 
useful purpose and is, in fact, unjustified and misleading. It 
is unjustified because it not only includes the origin (X] = l, 
x k =0 for ft >1) in the correlation and VIF calculations, but 
moreover, gives this point infinite weight in these calcula- 
tor. Yet, no measurement was made at that point. Equation 
29 is also misleading because it leads to very large VIF 
values for some non-centered regressions, implying that 
severe "ill-conditioning" exists, even when the X matrix is 
except for some trivial coding, completely orthogonal (cf. 

[6]). 

The ill-conditioning exists only in terms of the large VIF 
value. It is an artifact arising from the desire to make the two 
forms of the regression equation into two distinct "models". 

The two forms, eqs 28a and 28b lead to identical esti- 
mates for the P,-, including Pj, and for their standard errors. 
They also lead to identical values and variances for an esti- 
mated (predicted) y, at any point of the X space. There 
seems to be no valid reason for the two distinct equations for 
the VIF. They only lead to the false impression that center- 
ing can reduce or even remove collinearity. 

Our viewpoint in this paper is that the usefulness of a 
regression equation lies in its abilty to "predict" y for inter- 
esting combinations of the x's. We also take the position 
that inferences from the data alone should be confined to x 
points that are in the general geometric vicinity of the cluster 
of design points. An inference for points that are well out- 
side this domain (i.e., outside a suitably defined EPD) is, in 
the absence of additional information, only a tentative con- 
clusion, and not a valid scientific inference. Such conclu- 
sions may however, be very useful, provided their tentative 
character is recognized, and provided they are subsequently 
subjected to further experimental verification. 

Daniel and Wood [7] discuss briefly the relation between 
the variance of y and the location of the point at which the 
prediction is made. However, their discussion is in the con- 
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text of selecting the best subset of regressors from among 
the entire set of regressors, a subject different from the one 
dealt with in this paper. 

Another publication that deals explicitly with predictabil- 
ity is a paper by Willan and Watts [8], These authors define 
a "Region of Effective Predictability" (REP A ) as that portion 
of the X space in which the variance of the predicted y does 
not exceed twice the variance of y predicted at the centroid 
of the X matrix. The volume of the region is then compared 
with that of a similarly defined REP, denoted REP . The 
latter refers to a "fictitious orthogonal reference design" of 
"orthogonal data with the same N and the same rms values 
as the actual data." The ratio of the volume of REP A to that 
of REP is taken as "an overall measure of the loss of 
predictability volume due to collinearity". 

This concept, apart from its artificial character, suffers 
from other shortcomings. Like so many other treatments, it 
attempts to provide a measure of collinearity . But the prac- 
titioner who is confronted with a collinear^ matrix does not 
need a measure of collinearity: he needs a way to use the 
data for the purpose for which they were obtained. Further- 
more, this measure loses its meaning when expansion vari- 
ables are present. For example, for the artificial quadratic 
set of table 12, Willan and Watts' measure would indicate 
a high degree of collinearity which, while literally true, is 
totally misleading since the collinearity in no way reduces 
the usefulness and predicting power of the regression equa- 
tion, as long as the meaning of the expansion term is taken 
into account. But even in cases without expansion terms, the 
measure in question may be misleading. Thus when applied 
to the protein calibration data of table 5 , it may well lead the 
analyst to give up on these data as a hopelessly highly- 
col linear set, whereas, as we have seen, there is nothing 
wrong with this set and it can indeed be used very effec- 
tively for the calibration of a method for protein determina- 
tion based on reflectance measurements. 

Finally, a few words about estimating the B-coefficients 
considered as rates of change of y with changes in the 
individual x } . As pointed out by Box [9], this is generally 
not a desirable use of regression equations. If, however, it 



is the major purpose of a particular experiment, then this 
experiment should be designed accordingly, which means; 
essentially with an orthogonal X matrix. A collinear X ma- 
trix leads to the ability to estimate certain linear combina- 
tions of the B's much better than the G's themselves. The 
experimenter can calculate the VF values, not only for any 
point of X space, but also for any (3 or combination of B's, 
and he can do this without making a single measurement , 
i.e., in the planning stages of the experiment. If the exper- 
imenter does not take advantage of this opportunity , he may 
be in for considerable disappointment, after having spent 
time, money, and effort on inadequate experimentation. We 
believe that he advocacy of remedial techniques, such as 
Ridge Regression for collinear data is unwise, One of the 
most important tasks of a data analyst is to detect, and to call 
attention to, limitations in the use and interpretation of the 
data. 
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